Goto

Collaborating Authors

 qualitative result


Aligning Text to Image in Diffusion Models is Easier Than You Think

Neural Information Processing Systems

While recent advancements in generative modeling have significantly improved text-image alignment, some residual misalignment between text and image representations still remains. Some approaches address this issue by fine-tuning models in terms of preference optimization, etc., which require tailored datasets. Orthogonal to these methods, we revisit the challenge from the perspective of representation alignment--an approach that has gained popularity with the success of REPresentation Alignment (REPA) [46]. We first argue that conventional text-to-image (T2I) diffusion models, typically trained on paired image and text data (i.e., positive pairs) by minimizing score matching or flow matching losses, is suboptimal from the standpoint of representation alignment.


A.1 Qualitative Results of Bench

Neural Information Processing Systems

Figure 5: Word clouds of text prompts for the text-only generation (T2I) task (left) and the multimodal generation task (right). Figure 5 visually summarizes the prominent semantic elements in the benchmark prompts for text-only492 (T2I) and multimodal generation tasks. The differentiation of the word clouds reflects task-specific493 features of MMGen-Bench, emphasizing spatial and descriptive details in T2I tasks, while multimodal494 tasks more frequently involve social and interactive scenarios.495 Aspect Objects Relations Attributes Counting Overall Spearman ω 0.469 0.909 0.601 0.839 0.699 As depicted in Figure 6, the distribution of aspect types differs notably between the text-only497 generation (T2I) and multi-modal generation tasks. In the T2I setting, "Objects" dominate with498 38.3%, while "Attributes" and "Relations" also constitute substantial proportions (33.9% and 25.4%,499 respectively).


REArtGS: Reconstructing and Generating Articulated Objects via 3DGaussian Splatting with Geometric and Motion Constraints

Neural Information Processing Systems

Articulated objects, as prevalent entities in human life, their 3D representations play crucial roles across various applications. However, achieving both high-fidelity textured surface reconstruction and dynamic generation for articulated objects remains challenging for existing methods. In this paper, we present REArtGS, a novel framework that introduces additional geometric and motion constraints to 3DGaussian primitives, enabling realistic surface reconstruction and generation for articulated objects. Specifically, given multi-view RGB images of arbitrary two states of articulated objects, we first introduce an unbiased Signed Distance Field (SDF) guidance to regularize Gaussian opacity fields, enhancing geometry constraints and improving surface reconstruction quality. Then we establish deformable fields for 3DGaussians constrained by the kinematic structures of articulated objects, achieving unsupervised generation of surface meshes in unseen states. Extensive experiments on both synthetic and real datasets demonstrate our approach achieves high-quality textured surface reconstruction for given states, and enables high-fidelity surface generation for unseen states.


VisDiff: SDF-Guided Polygon Generation for Visibility Reconstruction, Characterization and Recognition

Neural Information Processing Systems

The ability to capture rich representations of combinatorial structures has enabled the application of machine learning to tasks such as analysis and generation of floorplans, terrains, images, and animations. Recent work has primarily focused on understanding structures with well-defined features, neighborhoods, or underlying distance metrics, while those lacking such characteristics remain largely unstudied. Examples of these combinatorial structures can be found in polygons, where a small change in the vertex locations causes a significant rearrangement of the combinatorial structure, expressed as a visibility or triangulation graphs. Current representation learning approaches fail to capture structures without well-defined features and distance metrics. In this paper, we study the open problem of Visibility Reconstruction: Given a visibility graph G, construct a polygon P whose visibility graph is G.


OV-PARTS: Towards Open-Vocabulary Part Segmentation (Supplementary Material) Coauthor Affiliation Address email

Neural Information Processing Systems

The supplementary material is organized as follows:1 Implementation Details.(Sec. Except for the Object Mask Prompt and Compositional Prompt Tuning designs,7 we follow the default architecture in the original ZSseg paper. The number of part queries is set to 50.8 All the two-stage baselines are trained with AdamW optimizer with the initial learning rate of 1e-49 and weight decay of 1e-4. A poly learning rate policy with a power of 0.9is adopted.



4b6538a44a1dfdc2b83477cd76dee98e-Supplemental.pdf

Neural Information Processing Systems

In this document, we provide more implementation details of CATs and more results on SPair71k [16], PF-PASCAL [4], and PF-WILLOW [3]. Given resized input images Is,It R256 256 3, we conducted experiments using different feature backbone networks, including DeiT-B [22], DINO [2] and ResNet-101 [5]. For the ResNet-101multi in the paper, we use the best layer subset [15] of (0,8,20,21,26,28,29,30) for SPair-71k, and (2,17,21,22,25,26,28) for PF-PASCAL and PF-WILLOW. We resized the spatial resolution of extracted feature maps to 16 16. The extracted features undergo l-2 normalization and the correlation maps are constructed using dot products.



Supplementary Materials: An Empirical Study of Adder Neural Networks for Object Detection

Neural Information Processing Systems

As discussed in prior literature [1, 4], one operation of floating-point addition and multiplication have energy costs of 0.9 pJ and 3.7 pJ, respectively. Meanwhile, one operation of 8-bit integer addition and multiplication have 0.03 pJ and 0.2 pJ energy costs, demonstrating much lower cost than floating-point operation. Therefore, it is important to explore whether adder detectors performs well for INT8 quantization. We tried to adopt INT8 post quantization for our Adder FCOS (B+N) model, which suffers 0.8 mAP drop compared with full precision model, as shown in Table A. The energy reduction further increases from 29% to 35%. Note that post training quantization is not optimal for INT8 models, and quantization-aware training may greatly further improve the accuracy.